cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

cuavas · 2025-02-01T06:31:23Z

This should implement some of the optimisations previously discussed for AArch64 code generation:

bl displacement is in words
The emit_*_mem functions know the operand size, so they can pass the corresponding shift to emit_ldr_str_base_mem rather than trying to calculate it after the fact
An immediate load/store offset can be either a 9-bit signed byte offset or an unsigned 12-bit element offset
An unsigned 12-bit element offset can always reach an entire page, so the page-relative access can always be done in two instructions

And one bug fix:

AArch64 doesn’t allow a variable left shift for a register offset, it only allows zero or the element size, so there’s no point testing intermediate shift values.

@987123879113 and/or @rb6502 can you check this out and test it?

…ligned accesses.

cuavas · 2025-02-01T07:25:59Z

I just realised the 12-bit unsigned offset can only reach an entire page for aligned accesses. Hopefully the vast majority of accesses are aligned anyway.

There’s still one form of ldr we aren’t trying to use – the PC-relative form with a 19-bit signed displacement in words ±1MB reach). It can only be used to load a word or doubleword (not a byte or halfword, or a floating-point type), and there’s no equivalent str form.

987123879113 · 2025-02-01T08:42:43Z

Code looks fine to me. I ran it through a few games + the tester and no issues from what I can tell.

rb6502 · 2025-02-01T20:27:44Z

A few before/after -str 90 -nothrottle runs:

shienryu (SH2)       414.39%   406.62%
s1945iii (SH2)      1319.27%  1296.48%
toyfight (SH4)       153.04%   168.68%
dc (SH4)             139.08%   142.06% (w/software list revilcv)
calspeed (MIPS)      332.66%   539.99%
kinst2 (MIPS)        729.66%   792.97%
gradius4 (PPC)       506.45%   501.04%
scud (PPC)           103.09%   103.20%

Very beneficial for MIPS, pretty much a wash otherwise.

cuavas · 2025-02-01T20:37:09Z

Those MIPS results are kind of insane. Is it doing an inordinate number of fastram accesses or something?

rb6502 · 2025-02-01T20:55:34Z

Not sure what it was doing. Tried it again just now and the new code is still faster, but it's a much more normal delta.

cuavas added 2 commits February 1, 2025 17:26

cpu/drcbearm64.cpp: Optimise load/store and call generation.

a21e0d7

cpu/drcbearm64.cpp: 12-bit offset can only reach an entire page for a…

d9221b2

…ligned accesses.

cuavas merged commit cdfb07c into mamedev:master Feb 1, 2025
5 checks passed

cuavas deleted the a64offsets branch February 1, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

cuavas commented Feb 1, 2025

cuavas commented Feb 1, 2025

987123879113 commented Feb 1, 2025

rb6502 commented Feb 1, 2025 •

edited

Loading

cuavas commented Feb 1, 2025

rb6502 commented Feb 1, 2025

cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

Conversation

cuavas commented Feb 1, 2025

cuavas commented Feb 1, 2025

987123879113 commented Feb 1, 2025

rb6502 commented Feb 1, 2025 • edited Loading

cuavas commented Feb 1, 2025

rb6502 commented Feb 1, 2025

rb6502 commented Feb 1, 2025 •

edited

Loading